In an ideal world, we have for each class \(c = 1,...,C\) the probability for \(x\) to belong to the \(c\)-th class. For example if we are classifying fruits, \(K_1\) could be “Banana”, \(K_2\) could be “Apple” etc When we write \(\mathbb{P}(K_c |x)\) we are asking: “Given that we observed \(x\), what is the probability that it belongs to class \(c\)?” If we denote the event of belonging to class \(c\) by \(K_c\), we know the conditional probabilities \(\mathbb{P}(x |K_c)\) (in English: conditional distributions), e.g. \(\mathbb{P}(1003g, \text{red|Banana}) = 0\%\). Thus, we can use Bayes’ theorem to obtain \[\mathbb{P}(K_c|x) = \frac{\mathbb{P}(K_c)\mathbb{P}(x|K_c)}{\mathbb{P}(x)} = \frac{\mathbb{P}(K_c)\mathbb{P}(x|K_c)}{\sum_{i=1}^C \mathbb{P}(K_i)\mathbb{P}(x|K_i)}\]
The probability \(\mathbb{P}(K_c)\) of the \(c\)-th class is usually abbreviated as \(\pi_c\) (in English: class prior), leading to
\[\mathbb{P}(K_c|x) = \frac{\pi_c\mathbb{P}(x|K_c)}{\sum_{i=1}^C \pi_c\mathbb{P}(x|K_i)}\] Classification is then done by choosing the most probable class, i.e. \[ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{c=1}^C \frac{\pi_c\mathbb{P}(x|K_c)}{\sum_{i=1}^C \pi_c\mathbb{P}(x|K_i)} = \argmax_{c=1}^C \hspace{0.1cm} \pi_c\mathbb{P}(x |K_c) \] (Since the denominator is the same for all classes, we can just compute the numerator!) This means:
Imagine we want to classify a fruit \(x\) based on weight and color. Suppose we have
If we observe \(x = (1003g, red)\), the probability of observing that for each fruit is
The mapping \(x \mapsto \argmax_{c=1}^C \pi_c \mathbb{P}(x|K_c)\) is also called the Bayes classifier. This means that given some feature \(x\), we assign it to the class \(K_c\) that maximizes \(\pi_c \mathbb{P}(x|K_c)\) Note that this is independent of the denominator in #2.2.9 Often, \(x\) is a continuous random variable, making it meaningless to work with \(\mathbb{P}(x |K_c)\): If \(x|K_c\) is, for example, normally distributed, then \(\mathbb{P}(x |K_c) = 0\) for all \(x \in \mathbb{R}^d\) In this case, however we can use Bayes’ theorem for densities and obtain
\[\mathbb{P}(K_c|x) = \frac{\pi_c p(x|K_c)}{\sum_{i=1}^C \pi_cp(x|K_i)}\] where \(p(x|K_c)\) are the conditional densities (\(p(x|K_c)\) now is a PDF of \(x\) given class \(K_c\)). Note that the left side still represents a probability and not a density since the possible classes form a discrete and finite probability space! In practe, we unfortunately have neither \(\pi_i\) nor \(\mathbb{P}(x |K_c)\) or \(p(x|K_c)\) available. Both quantities must be approximated using the data. To this end, one can use the class proportions: \[\hat{\pi}_c := \frac{\#\{i = 1,...,N \hspace{0.1cm} | \hspace{0.1cm} y_i = l_c\}}{N}\] This just means:
Additionally, the conditional probabilities \(\mathbb{P}(x|K_c)\) or densities \(p(x|K_c)\) must be suitably approximated by \(\hat{P}(x|K_c)\) or \(\hat{p}(x|K_c)\). - \(\hat{P}(x|K_c)\) is an approximation of the probability \(\mathbb{P}(x|K_c)\) when \(x\) is discrete (e.g. categorical features) - \(\hat{p}(x|K_c)\) is an approximation of the probability density function \(p(x|K_c)\) when \(x\) is continuous (e.g. real-valued features)
We then obtain the approximate Bayes classifier: \[x \mapsto \argmax_{c=1}^C \hspace{0.1cm}\hat{\pi}\hat{P}(x|K_c) \hspace{0.5cm} \text{or} \hspace{0.5cm} x \mapsto \argmax_{c=1}^C\hspace{0.1cm} \hat{\pi}_c \hat{p}(x|K_c)\] This means:
If \(x |K_c\) is a discrete random variable (e.g. colors red, green, yellow), one can approximate \(\mathbb{P}(x|K_c)\) by proportions in the training data, e.g.
\[\hat{P}(\text{red}|K_c) := \frac{\#\{i |x_i = \text{red}, y_i = l_c\}}{\# \{i | y_i = l_c\}}\] This means:
There are 3 red apples
and 4 apples in total so \[\hat{P}(\text{red}|K_{\text{apple}}) =
\frac{3}{4} = 0.75\] This means that 75% of apples in our
training data are red
For continuous variables, a common assumption is that \(x|K_c \sim \mathcal{N}(\mu_c, \sigma_c^2)\) is normally distributed with means \(\mu_c\) and variances \(\sigma_c^2\) for \(c=1,...,C\). For simplicity, we will only consider the one-dimensional case here. We can approximate the densities \(p(x|K_c)\) for \(x \in \mathbb{R}\) by \[\hat{p}(x|K_c) := \frac{1}{\sqrt{2\pi \hat{\sigma_c}^2}} \text{exp}\biggl(-\frac{(x-\hat{\mu}_c)^2}{2\hat{\sigma_c}^2}\biggl)\] for \(c = 1,...,C\) where \[\mu_c = \frac{1}{\#\{i | y_i = l_c\}}\sum_{i |y_i = l_c}x_i \hspace{1cm} \hat{\sigma}_c^2 := \frac{1}{\#\{i | y_i = l_c\}} \sum_{i|y_i=l_c}(x_i-\mu_c)^2 \] are the mean and discrete variance of the points with label \(l_c\) - \(\mu_c\) is just the average of all feature values \(x_i\) that belong to class \(c\) - The denominator is the number of training samples in class \(c\) - The numerator sums all \(x_i\) values for class \(c\) - \(\hat{\sigma_c}^2\) is the variance estimate, which calculates the spread of values around the meaen - Each sample’s difference from the mean is squared and summed, then divided by the number of samples of training points in class \(c\)
If the distribution of \(x |K_c\) is more complicated, the Gaussian Mixture Model may not be a good choice. In this case, one can approximate it using a so-called kernel density estimator. We also restrict ourselves here to the case \(d=1\), i.e. the inputs are one-dimensional. We define
\[\hat{p}(x|K_c) := \frac{1}{\#\{i|y_i = l_c\}} \sum_{i|y_i=l_c} K_\lambda (x-x_i)\] where \(\lambda >0\) is the bandwidth, \(K_\lambda(x) := \frac{1}{\lambda}K(\frac{x}{\lambda})\) and \(K : \mathbb{R} \to \mathbb{R}\) is a non-negative function with \(\int_\mathbb{R} K(x)dx = 1\). Possible choices are the Gaussian kernel \[K(x) := (2\pi)^{-1/2}\text{exp}(-x^2/2)\] or \(K(x) := \frac{1}{2}1_{|x| \leq 1}\). However note that for the second choice, it may happen that \(\hat{p}(x|K_c) = 0\) for all \(c= 1,...,C\) which does not allow for class assignment. This approach depends on the choice of bandwidth \(\lambda\). In particular, for \(0 \lt \lambda \ll 1\), \(\hat{p}(x|K_c) \approx 0\) if \(x\) is not a training data point \(x_i\). On the other hand, if \(\lambda\) is very large, \(\hat{p}(\cdot |K_c)\) is nearly constant. It can be shown that an optimal choice of \(\lambda\) for large values of \(n \in \mathbb{N}\) takes the form \(\lambda = Cn^{-\frac{1}{5}}\). For normally distributed data \(x_i \sim \mathcal{N}(\mu, \sigma^2)\), the optimal constant is given by \(C = (\frac{4}{3})^{\frac{1}{5}}\sigma\)
Next up: 2.2.3.2 Naive Bayes Classification